tidytof: a user-friendly framework for scalable and reproducible high-dimensional cytometry data analysis

Abstract Summary While many algorithms for analyzing high-dimensional cytometry data have now been developed, the software implementations of these algorithms remain highly customized—this means that exploring a dataset requires users to learn unique, often poorly interoperable package syntaxes for each step of data processing. To solve this problem, we developed {tidytof}, an open-source R package for analyzing high-dimensional cytometry data using the increasingly popular ‘tidy data’ interface. Availability and implementation {tidytof} is available at https://github.com/keyes-timothy/tidytof and is released under the MIT license. It is supported on Linux, MS Windows and MacOS. Additional documentation is available at the package website (https://keyes-timothy.github.io/tidytof/). Supplementary information Supplementary data are available at Bioinformatics Advances online.


Supplementary Tables
Supplementary Table 1 Proteomic data {tidytof} calculations Metadata cell_id protein_1 protein_2 protein_3 tsne_1 tsne_2 cluster sample_type patient {tidytof} represents high-dimensional cytometry data in a "tidy format" using an extended data frame called a "tof_tbl". In this format, data are represented such that each cell is given its own row and each measurement or piece of metadata is given its own column.  [12][13][14] {tidytof} performance benchmarking
Workflows were compared both on the basis of their computational speed and on the basis of their coding burden (i.e. the amount of code needed for each workflow). Specifically, comparisons included one metric for computational speed (total elapsed time for each workflow) and two metrics for coding burden associated with human error when conducting computational analyses (the number of lines of code and the number of intermediate variable assignments needed for the workflow). In addition, {tidytof}'s memory requirements were benchmarked against those of the {flowCore} package's flowSet objects -a commonly-used data structure for the analysis of high-dimensional cytometry data -as well as against {cytofkit} data.frames, {immunoCluster} SingleCellExperiments [20], and {Spectre} data.tables [21].
All benchmarking was performed on a Linux machine running Ubuntu 20.04 with an AMD Threadripper Pro 3995WX processor (64 cores; 2.70 GHz; 256 MB cache) and 256 GB of RAM. All functions used to perform speed, coding burden, and memory benchmarking are available here.

Benchmarking datasets
The data used for benchmarking were drawn from Good
The system memory used to store each object was then calculated using the obj_size() function from the {lobstr} package [23].

Coding burden comparison
For coding burden comparisons, the source code of the functions containing each workflow was subjected to text analysis. To control for differences in code style and practices between workflows, the {styler} package [24] was used to standardize all benchmarking code to adhere to the tidyverse R code style guide. The functions used to count the number of lines of code and the number of variable assignments in each workflow are provided here: # count the lines of code for a given workflow lines <-capture.output(print(function_object)) # substract 2 lines to omit the function definition and closing # bracket for each workflow return(length(lines) -2L) } } # count the number of assigned variables for a given workflow lines <-capture.output(print(function_object)) assignments <-max(sum(str_count(lines, pattern = "<-")), 1L) return(assignments) } } For both functions above, the input is a function (function_object) used to define a {tidytof}, base R, {flowCore}, {cytofkit}, {immunoCluster}, or {Spectre} workflow, and the output is an integer representing the number of lines of code or the number of variable assignments, respectively, contained in that workflow. The functions used to define each workflow are available on GitHub here.

Speed benchmarking (Supplementary Figure 1)
Across a wide variety of workflows including FCS and CSV file reading, single-cell data transformation (preprocessing), downsampling, FlowSOM clustering, sample-level feature extraction, tSNE embedding, and UMAP embedding, {tidytof}'s time efficiency rivals or improves upon alternative approaches (Supplementary Figure 1). In the case of PCA, {tidytof} incurs a slight performance decrease relative to alternative tools because it uses the {embed} package [25] to store a dataset's PCA embedding as a recipe object (from the {recipes} package) [26] -which, while costing a small amount of computational overhead, allows users to more easily apply the same embedding to new data points. While {Spectre}'s use of data.table makes it slightly faster than {tidytof} for simple arithmetic tasks like data preprocessing (Supplementary Figure 1C) and downsampling (Supplementary Figure 1D), {tidytof} requires substantially less cumulative time than competing methods when performing multiple workflows (Supplementary Figure 1J). shows the sum of runtimes from panels A-F for each tool (because panels G-I were run on different datasets, they could not be combined with the other panels). In all panels, points represent the median runtime for each workflow from 10 independent repetitions and error bars represent the interquartile range of runtimes.

Memory benchmarking (Supplementary Figure 2)
Despite the increased simplicity of {tidytof}'s tidy data structures (tof_tbl objects) compared to {flow-Core}'s native data structure (the flowSet), both data structures require similar amounts of system memory (Supplementary Figure 2). Furthermore, {tidytof} requires slightly less memory than {Spectre} data.tables and substantially less memory than {immunoCluster} SingleCellExperiments and {cytofkit} data.frames across all dataset sizes.

Coding burden comparison (Supplementary Figures 3-4)
In addition to its competitive computational speed relative to equivalent workflows using other cytometry data analysis software, {tidytof} reduces the coding burden of performing high-dimensional cytometry data analyses significantly compared to alternative frameworks. Across workflows, {tidytof} reduces the lines of code needed to perform an analysis between 33% and 95% compared to alternative software tools (Supplementary Figure 3). Similarly, {tidytof} reduces the number of variable assignments needed to perform an analysis between 67% to 98% compared alternative software tools (Supplementary Figure 4). Because code bases with both more lines of code and more intermediate variables are more prone to human (e.g copy-and-paste) error and reduced readability, these results support {tidytof}'s marked ability to simplify high-dimensional cytometry data analysis code and increase reproducibility among data analysis pipelines.     Intermediate variable assignments

Supplementary Notes Supplementary Note 1 -A beginner's introduction to {tidytof}
Analyzing single-cell data can be surprisingly complicated. One one hand, this is partially because single-cell data analysis is an incredibly active area of research, with new methods being published at a rapid rate. Accordingly, when new tools are shared with the scientific community, they often require researchers to learn unique, method-specific application programming interfaces (APIs) with distinct requirements for input data formatting, function syntax, and output data structure. Furthermore, analyzing single-cell data can be challenging because it often involves simultaneously asking questions at multiple levels of biological scopethe single-cell level, the cell subpopulation (i.e. cluster) level, and the whole-sample or whole-patient leveleach of which has distinct data processing needs. To address both of these challenges for high-dimensional cytometry [27], {tidytof} implements a concise, integrated "grammar" of single-cell data analysis capable of answering a variety of biological questions.
Available as an open-source R package, {tidytof} provides an easy-to-use pipeline for analyzing highdimensional cytometry data by automating many common data-processing tasks under a common "tidy data" interface. This vignette introduces you to the tidytof's high-level API and shows quick examples of how they can be applied to high-dimensional cytometry datasets.

Prerequisites
{tidytof} makes heavy use of two concepts that may be unfamiliar to R beginners. The first is the pipe (|>), which you can read about here. The second is "grouping" data in a data.frame or tibble using dplyr::group_by, which you can read about here. Most {tidytof} users will also benefit from a relatively in-depth understanding of the dplyr package, which has an introductory vignette here [28]:

vignette("dplyr")
Everything else should be self-explanatory for both beginner and advanced R users, though if you have zero background in running R code, you should read this chapter of R for Data Science by Hadley Wickham [29].

Workflow basics
Broadly speaking, {tidytof}'s functionality is organized to support the 3 levels of analysis inherent to single-cell data described above: 1. Reading, writing, preprocessing, and visualizing data at the level of individual cells 2. Identifying and describing cell subpopulations or clusters 3. Building models (for inference or prediction) at the level of patients or samples {tidytof} provides functions (or "verbs") that operate at each of these levels of analysis: • Cell-level data: -tof_read_data() reads single-cell data from FCS or CSV files on disk into a tidy data frame called a tof_tbl. tof_tbls represent each cell as a row and each protein measurement (or other piece of information associated with a given cell) as a column. -tof_preprocess() transforms protein expression values using a user-provided function (i.e. logtransformation, centering, scaling) -tof_downsample() reduces the number of cells in a tof_tibble via subsampling.
-tof_reduce_dimensions() performs dimensionality reduction (across columns) -tof_write_data writes single-cell data in a tof_tibble back to disk in the form of an FCS or CSV file.
• Cluster-level data: -tof_cluster() clusters cells using one of several algorithms commonly applied to high-dimensional cytometry data -tof_metacluster() agglomerates clusters into a smaller number of metaclusters -tof_daa() performs differential abundance analysis (DAA) for clusters or metaclusters across experimental groups -tof_dea() performs differential expression analysis (DEA) for clusters' or metaclusters' marker expression levels across experimental groups -tof_extract_features() computes summary statistics (such as mean marker expression) for each cluster. Also (optionally) pivots these summary statistics into a sample-level tidy data frame in which each row represents a sample and each column represents a cluster-level summary statistic. -tof_upsample() assigns new cells to previously-computed clusters based on the similarity between each new cell and the cells in each cluster.
• Sample-level data: -tof_split_data() splits sample-level data into a training and test set for predictive modeling -tof_create_grid() creates an elastic net hyperparameter search grid for model tuning -tof_train_model() trains a sample-level elastic net model and saves it as a tof_model object -tof_predict() Applies a trained tof_model to new data to predict sample-level outcomes -tof_assess_model() calculates performance metrics for a trained tof_model

{tidytof} verb syntax
With very few exceptions, {tidytof} functions follow a specific, shared syntax that involves 3 types of arguments that always occur in the same order. These argument types are as follows: 1. For almost all {tidytof} functions, the first argument is a data frame (or tibble). This enables the use of the pipe (|>) for multi-step calculations, which means that your first argument for most functions will be implicit (passed from the previous function using the pipe). This also means that most {tidytof} functions are so-called "single-table verbs," with the exception of tof_cluster_ddpr, which is a "twotable verb" (for details about how to use tof_cluster_ddpr, see the "clustering-and-metaclustering" vignette). 2. The second group of arguments are called column specifications, and they end in the suffix _col or _cols. Column specifications are unquoted column names that tell a {tidytof} verb which columns to compute over for a particular operation. For example, the cluster_cols argument in tof_cluster allows the user to specify which column in the input data frames should be used to perform the clustering. Regardless of which verb requires them, column specifications support tidyselect helpers [30]. 3. Finally, the third group of arguments for each {tidytof} verb are called method specifications, and they're comprised of every argument that isn't an input data frame or a column specification. Whereas column specifications represent which columns should be used to perform an operation, method specifications represent the details of how that operation should be performed. For example, the tof_cluster_phenograph() function requires the method specification num_neighbors, which specifies how many nearest neighbors should be used to construct the PhenoGraph algorithm's k-nearest-neighbor graph. In most cases, {tidytof} sets reasonable defaults for each verb's particular method specifications, but your workflows are can also be customized by experimenting with non-default values.

Other tips
{tidytof} was designed by a multidisciplinary team of wet-lab biologists, bioinformaticians, and physicianscientists who analyze high-dimensional cytometry and other kinds of single-cell data to solve a variety of problems. As a result, {tidytof}'s high-level API was designed with great care to mirror that of the {tidyverse} itself -that is, to be human-centered, consistent, composable, and inclusive for a wide userbase.
Practically speaking, this means a few things about using {tidytof}.
First, it means that {tidytof} was designed with a few quality-of-life features in mind. For example, you may notice that most {tidytof} functions begin with the prefix tof_. This is intentional, as it will allow you to use your development environment's code-completing software to search for {tidytof} functions easily (even if you can't remember a specific function name). For this reason, we recommend using {tidytof} within the RStudio development environment; however, many code editors have predictive text functionality that serves a similar function. In general, {tidytof} verbs are organized in such a way that your IDE's code-completion tools should also allow you to search for (and compare) related functions with relative ease.
(For instance, the tof_cluster_ prefix is used for all clustering functions, and the tof_downsample_ prefix is used for all downsampling functions).
Second, it means that {tidytof} functions should be relatively intuitive to use due to their shared logic -in other words, if you understand how to use one {tidytof} function, you should understand how to use most of the others. An example of shared logic across {tidytof} functions is the argument group_cols, which shows up in multiple verbs (tof_downsample, tof_cluster, tof_daa, tof_dea, tof_extract_features, and tof_write_data). In each case, group_cols works the same way: it accepts an unquoted vector of column names (specified manually or using tidyselection) that should be used to group cells before an operation is performed. This idea generalizes throughout {tidytof}: if you see an argument in one place, it will behave identically (or at least very similarly) wherever else you encounter it.
Finally, it means that {tidytof} is optimized first for ease-of-use, then for performance. Because humans and computers interact with data differently, there is always a trade-off between choosing a data representation that is intuitive to a human user vs. choosing a data representation optimized for computational speed and memory efficiency. When these design choices conflict with one another, our team tends to err on the side of choosing a representation that is easy-to-understand for users even at the expense of small performance costs. Ultimately, this means that {tidytof} may not be the optimal tool for every high-dimensional cytometry analysis, though hopefully its general framework will provide most users with some useful functionality.

Supplementary Note 2 -Example {tidytof} workflows
In this note, we provide some example code and output demonstrating how to use {tidytof} to perform common CyTOF data analysis tasks. For additional, exhaustive details for each of {tidytof}'s functions, see the package website.
Finally, we can join this metadata with our single-cell tof_tbl to obtain the cleaned dataset.
citrus_data <-citrus_data |> left_join(citrus_metadata, by = "sample_id") After these data cleaning steps, we now have citrus_data, a tof_tbl containing cells collected from 8 patients. Specifically, 2 samples were taken from each patient: one in which the cells' B-cell receptors were stimulated ("BCR-XL") and one in which they were not ("Basal"). In citrus_data, each cell's patient of origin is stored in the patient column, and each cell's stimulation condition is stored in the stimulation column. In addition, the population_id column stores information about cluster labels that were applied to each cell using a combination of FlowSOM clustering and manual merging (for details, run ?HDCytoData::Bodenmiller_BCR_XL in the R console).

Analyzing single-cell data
As a tof_tbl, citrus_data has access to all {dplyr} functions for general data cleaning. For example, we might want to remove some columns that are either redundant or represent measurements that we're not interested in using the select() function. In addition, we might want to convert some categorical columns initially represented as numeric vectors into character vectors so that we don't accidentally use them for computation using the mutate() function. We can also rename each of our protein measurement columns to omit the metal names using the rename() function in order to make column selection a bit easier.